Characterizing the Use of Program Vulnerability Factors for Studying Transient Fault Tolerance in Multi-core Architectures
ثبت نشده
چکیده
Semiconductor transient faults (soft errors) are a critical design concern in the reliability of computer systems. Most recent architecture research is focused on using performance models to provide Architecture Vulnerability Factor (AVF) estimates of processor reliability rather than deploying detailed fault injection into hardware RTL models. While AVF analysis provides support for investigating new fault tolerant architecture techniques, program execution characteristics are largely missing from determining periods of soft error susceptibility. The primary problem with AVF is that software periods of vulnerability substantially differ from micro-architecture periods of vulnerability. As research trends dictate finding ways to selectively enable software-based transient fault tolerant mechanisms, runtime and off-line experimental techniques must be guided equally by program behavior and hardware. To address issues with AVF as well as the efficiency of fault injection studies, we examine elements of Program Vulnerability Factor (PVF) in the context of multi-core architectures. PVF has previously been introduced to consider program behavior in the form of memory/register vulnerability, however we explore static and profile based techniques for extending the work. By leveraging PVF we explore some initial contributions to the area of computer architecture research. First, we demonstrate that a more efficient fault injection campaign can be constructed and the outcome of fault injections in application execution can be accurately predicted. Second, compiler optimizations can be applied to better understand how the compiler affects fault susceptibility and program behavior. Finally, we motivate the need for developing a PVF metric for program data that is communicated between cores.
منابع مشابه
Reliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملDesign and Analysis of Transient Fault Tolerance for Multi Core Architecture
This paper describes the software approach of fault tolerance for shared memory multi core system using PLR.PLR uses a software-centric approach transient fault tolerance which ensuring a correct software execution. This scheme is used at user space level which does not necessitate changes to the original application.PLR create a set of redundant process per application process. In this scheme ...
متن کاملHardware Dependability in the Presence of Soft Errors
Using formal verification for designing hardware designs free from logic design bugs has been an active area of research since the last 15 years. Technology has matured and we have a choice of formal tools such as model checkers, equivalence checkers, and a range of theorem provers. Hardware reliability and fault tolerance has been studied for a long time as well, and some good solutions in the...
متن کاملDesign of a novel congestion-aware communication mechanism for wireless NoC architecture in multicore systems
Hybrid Wireless Network-on-Chip (WNoC) architecture is emerged as a scalable communication structure to mitigate the deficits of traditional NOC architecture for the future Multi-core systems. The hybrid WNoC architecture provides energy efficient, high data rate and flexible communications for NoC architectures. In these architectures, each wireless router is shared by a set of processing core...
متن کاملImproving the palbimm scheduling algorithm for fault tolerance in cloud computing
Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009